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The  Parallel  Diagonal  Dominant  (FDD)  algorithm  is  a  highly  efficient,  ideally  scalable  tridiago¬ 
nal  solver.  In  this  paper,  a  detailed  study  of  the  PDD  algorithm  is  given.  First  the  PDD  algorithm 
is  introduced.  Then  the  algorithm  is  extended  to  solve  periodic  tridiagonal  systems.  A  variant, 
the  reduced  PDD  algorithm,  is  also  proposed.  Accuracy  analysis  is  provided  for  a  class  of  tridi- 
agonal  systems,  the  symmetric  and  anti-symmetric  ToepUtz  tridiagonal  systems.  Implementation 
results  show  that  the  analysis  gives  a  good  bound  on  the  relative  error,  and  the  algorithm  is  a  good 
candidate  for  the  emerging  massively  parallel  machines. 
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1  Introduction 


Solving  tridiagonal  systems  is  one  of  the  key  issues  in  computational  fluid  dynamics  (CFD)  and 
many  other  scientific  applications  [21, 13].  Many  methods  used  for  the  solution  of  partial  differential 
equations  (PDEs)  rely  on  solving  a  sequence  of  tridiagonal  systems.  The  alternating  direction 
implicit  (ADI)  method,  the  most  widely  used  implicit  method  for  PDEs  [17],  solves  PDEs  by  solving 
tridiagonal  systems  alternately  in  each  coordinate  direction.  Discretization  of  partied  differential 
equations  by  compact  difference  schemes  also  leads  to  a  sequence  of  tridiagonal  systems.  Tridiagonal 
systems  also  arise  in  multigrid  methods  and  in  ADI  or  line-SOR  preconditioners  for  conjugate 
gradient  methods.  In  addition  to  solving  PDE’s,  tridiagonal  systems  also  arise  in  many  other 
applications  [1]. 

Solving  tridiagonal  systems  is  inexpensive  on  sequential  machines.  However,  because  of  their 
serial  nature,  tridiagonal  systems  are  difficult  to  solve  efficiently  on  parallel  computers.  Thus 
intensive  research  has  been  done  on  the  development  of  efficient  parallel  tridiagonal  solvers.  Many 
algorithms  have  been  proposed  [14,  8].  Among  them,  the  recursive  doubling  reduction  method 
(RCD),  developed  by  Stone  [16],  and  the  cyclic  reduction  or  odd-even  reduction  method  (OER), 
developed  by  Hockney  [9],  are  able  to  solve  an  n-dimensional  tridiagonal  system  in  0{log  n)  time 
using  n  processors.  These  are  effective  algorithms  for  fine  grained  computing.  Later,  several 
algorithms  were  proposed  for  median  and  coarse  grain  computing,  i.e.  for  the  case  of  p  <  n  or 
p  «  n,  where  p  is  the  number  of  processors  available  [5,  11,  22].  The  algorithm  given  by  Lawrie 
and  Sameh  [11]  and  the  algorithm  given  by  Wang  [22]  can  be  considered  substructured  methods. 
1  tiese  algorithms  partition  the  original  problem  into  sub-problems.  The  sub-problems  are  solved  in 
parallel,  and  then  the  solutions  of  the  sub-problems  are  combined  to  obtain  the  final  solution.  All 
of  these  parallel  tridiagonal  solvers  increase  parallelism  by  adding  additional  computation.  They 
trade  increased  work  for  reduced  communication  overhead  and  better  load  balance  and  have  a 
larger  operation  count  than  the  best  sequential  algorithm. 

Recently,  Sun,  Zhang,  and  Ni  [21]  have  proposed  three  parallel  algorithms  for  solving  tridiagonal 
systems.  All  of  these  algorithms  are  based  on  Sherman-Morrison  matrix  modification  formula  [3]. 
The  parallel  partition  LU  (PPT)  algorithm  and  the  parallel  hybrid  (PPH)  algorithm  are  fast  and 
able  to  incorporate  limited  pivoting.  The  PPT  algorithm  is  a  good  candidate  when  the  number  of 
processors,  p,  is  small.  The  PPH  algorithm  is  a  better  choice  when  p  is  large.  Finally,  for  diagonal 
dominant  problems,  the  (PPD)  algorithm  is  the  most  efficient. 

Compared  with  other  tridiagonal  solvers,  which  all  have  at  least  0(log  p)  communication  cost, 
the  PDD  algorithm  has  only  a  small  fixed  communication  cost  and  a  small  amount  of  additional 
computation.  In  fact,  the  PDD  algorithm  is  perfectly  scalable,  in  the  the  sense  that  the  communi¬ 
cation  cost  and  the  computation  overhead  do  not  increase  with  the  problem  size  or  with  the  number 
of  processors  available. 
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Modern  technological  advances  have  made  it  possible  to  build  computers  containing  more  and 
more  processors.  Multiprocessors  with  hundreds  or  thousands  of  processors  are  commercially  avail¬ 
able.  Recent  parallel  computers,  such  as  the  Intel  Paragon,  Thinking  Machine  Corporalions’s 
CM-5,  and  Cray’s  MPP,  highlight  the  use  of  high-density  and  high-speed  processor  and  memory 
chips  based  on  ultra-large-scale  integration  (ULSI)  or  very  high-speed  integrated  circuits  ( VHSlCs). 
With  this  new  technology,  64-bit  150-MHz  microprocessors,  for  example,  are  now  available  on  a 
single  chip  having  l.O-million  transistors  [10].  The  emerging  parallel  computers  build  on  such  mi¬ 
croprocessors  are  noted  for  their  scalable  architecture  and  mzissively  parallel  processing.  They  are 
designed  for  grand  challenge  applications  which  could  not  otherwise  be  tackled. 

Scalability  has  become  an  important  metric  of  parallel  algorithms  [6,  20,  19].  Its  perfect  scala¬ 
bility  and  high  efficiency  make  the  PDD  algorithm,  when  applicable,  an  ideal  choice  on  these  new 
machines.  However,  the  PDD  algorithm  is  relatively  new  and  applicable  only  under  certain  condi¬ 
tions.  In  this  paper  we  give  a  detailed  study  of  the  PDD  algorithm.  We  study  the  applications  of 
the  PDD  algorithm  and  provide  a  formal  accuracy  analysis  for  Toeplitz  tridiagonal  systems.  The 
PDD  algorithm  proposed  in  this  paper  is  slightly  different  from  the  algorithm  proposed  in  [21]. 
Extended  study  is  provided  for  different  applications,  such  as  periodic  systems,  and  systems  with 
multiple  right-sides.  The  reduced  PDD  algorithm  is  also  proposed.  Simple  formulas  are  provided 
for  accuracy  checking  for  symmetric  and  anti-symmetric  Toeplitz  tridiagonal  systems. 

This  paper  is  organized  as  follows.  Section  2  will  introduce  the  sequential  and  parallel  PDD 
algorithms.  The  appheations  of  the  PDD  algorithm  will  be  discussed  in  Section  3.  This  section  will 
also  give  the  variant  of  the  PDD  algorithm  for  periodic  systems  and  the  reduced  PDD  algorithm. 
Section  4  will  give  the  accuracy  study  for  the  PDD  algorithm  and  the  reduced  PDD  algorithm. 
Experimental  results  on  an  Intel/860  multicomputer  will  be  presented  in  Section  .5.  Finally,  Section 
6  gives  the  conclusion  and  final  remarks. 

2  The  Parallel  Diagonal  Dominant  Algorithm 

We  are  interested  in  solving  a  tridiagonal  linear  system  of  equations 


Ax  =  d 


(1) 


wh^'re  A  is  a  tridiagonal  matrix  of  order  n 


A  = 


6o  Co 
Oi  bi  C] 
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=  [ao^oc,] 
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X  =  (xo, •  • and  d  =  (do, •  •  •, dn-i)^-  We  assume  that  A,x,  and  d  have  real  coefficients. 
Extension  to  the  complex  case  is  straightforward. 


2.1  The  Matrix  Partition  Technique 

To  solve  Eq.  (1)  efficiently  on  parallel  computers,  we  partition  A  into  submatrices.  For  convenience 
we  assume  that  n  =  pm.  The  matrix  A  in  Eq.  (2)  can  be  written  as 

A  =  A  +  AA,  (3) 


where 
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The  submatrices  Ai(t  =  0,  •  •  -  jp- 1)  are  mxm  tridiagonal  matrices.  Let  e,  be  a  column  vector 
with  its  ith  (0  <  t  <  n  -  1)  element  being  one  and  all  the  other  entries  being  zero.  We  have 


I ,  C2rr»— iCjm— 1 ,  ■  ■  ■ ,  l] 


"m— 1 


'(p-l)m-l 


=  VE^ 


where  both  V  and  E  are  n  x  2(p  -  1)  matrices.  Thus,  we  have 

A  =  A  +  VE^. 


(4) 


(5) 


Based  on  the  matrix  modification  formula  originally  defined  by  Sherman  and  Morrison  [15]  for 
rank-one  changes  and  generalized  by  Woodbury  [23],  and  assuming  that  all  A,’s  are  invertible,  Eq. 
(1)  can  be  solved  by 

X  =  A-M  =  (A4-VET)-id 

=  A~M- A“^V(I-|-E'^A-^V)-lE'’^A-id. 
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(6) 

(7) 


Note  that  /  is  an  identity  matrix  and  Z  =  I  +  e'*^A  is  a  pentadiagonal  matrix  of  order  2(p- 1). 
Let 


Ax  =  d  (8) 

AY  =  V  (9) 

h  =  E^x  (10) 

Z  =  I+E'^Y  (11) 

Zy  =  h  (12) 

Ai  =  Yy.  (13) 

Equation  (7)  becomes 

X  =  X  -  Ax.  (14) 


In  Eq.s  (8)  and  (9),  x  and  Y  are  solved  by  the  LU  decomposition  method.  By  the  structure  of 
A  and  V,  this  is  equivalent  to  solve 

=  [d<‘^a,•meo,C(,+l)m-lem-l],  (15) 

t  =  0,  •  •  •  ,p—  1.  Here  and  are  the  ith  block  of  x  and  d,  respectively,  and  are  possible 

nonzero  column  vectors  of  the  ith  row  block  of  y.  Equation  (15)  implies  that  we  only  need  to  solve 
three  linear  systems  of  order  m  with  the  same  LU  decomposition  for  each  i  (i  =  0, •••,?  -  1). 
In  addition,  we  can  skip  the  first  m  —  1  forward  substitutions  for  the  third  system  since  the  first 
m  —  1  components  of  the  vector  at  the  right-hand  side  are  all  zeros.  There  is  no  computation  or 
communication  involved  in  computing  h  and  Z. 

2.2  The  PDD  Algorithm 

Solving  Eq.  (12)  is  the  major  computation  involved  in  the  conquer  part  of  our  algorithms.  Different 
approaches  have  been  proposed  for  solving  Eq.(12),  which  results  different  algorithms  for  solving 
tridiagonal  systems  [21].  The  matrix  Z  in  Eq.  (12)  has  the  form 


1  0 

•  1 

1 

where  for  *  =  0, 1  are  solutions  of  Eq.  (15)  and  the  I’s  come  from  the  identity 

matrix  I.  In  practice,  especially  for  a  diagonal  dominant  tridiagonal  system,  the  magnitude  of  the 


last  component  of  component  of  w^'\  Wq  \  may  be  smaller  than  machine 

accuracy  when  p  <<  n.  In  this  case,  Wq^  and  can  be  dropped,  and  Z  becomes  a  diagonal  block 
system  consisting  of  (p  —  1)  2  x  2  independent  blocks.  Thus,  Eq.(12)  can  be  solved  efficiently  on 
parallel  computers,  which  leads  to  the  highly  efficient  parallel  diagonal  dominant  (FDD)  algorithm. 

In  the  sequential  FDD  algorithm,  since  Y  has  at  most  two  nonzero  entries  in  every  row,  and  Z 
is  a  diagonal  block  matrix  with  I’s  as  diagonal  elements,  (12)  takes  five  arithmetic  operations  per 
row,  and  the  evaluation  of  (13)  takes  four  operations  per  row.  Based  on  the  above  observations, 
and  together  with  a  careful  scaling  process,  we  conclude  that  the  seouential  FDD  algorithm  takes 
17n  —  9^  —  4p  —  9  arithmetic  operations. 

Using  p  processors,  the  FDD  algorithm  consists  of  the  foUowing  steps: 


Step  1.  Allocate  Ai,d^*\  and  elements  to  the  Jth  node,  where  0  <  i  <  p  —  1. 

Step  2.  Solve  (15).  AH  computations  can  be  <*xecuted  in  parallel  on  p  processors. 

Step  3.  Send  from  the  ith  node  to  the  (i  —  l)th  node,  for  i  =  1,  •  •  -  jp  —  1. 

Step  4.  Solve 


w. 


(*) 
m— 1 

1 


(17) 


in  parallel  on  the  ith  node  for  0  <  t  <  p  -  2.  Then  send  j/2i  from  the  ith  node  to  the  (r  +  l)th 
node,  for  f  =  0,  •  -sp-  2. 


Step  5.  Compute  (13)  and  (14).  We  have 


(18) 


(19) 


In  all  of  these,  one  has  only  two  neighboring  communications. 

Communication  cost  is  an  overhead  of  parallelism.  Recent  advanced  communication  mecha¬ 
nisms,  such  as  circuit  switching  and  wormhole  routing,  have  reduced  communication  delay  con¬ 
siderably.  However,  compared  with  the  improvement  of  processing  speed,  the  improvement  of 
communication  speed  is  relatively  small.  Communication  cost  ha.s  a  great  impact  on  overall  per¬ 
formance.  Empirically,  for  most  distributed-memory  computers,  the  communication  time  for  a 
neighboring  communication  is  a  linear  function  of  the  problem  size  [4].  Let  S  be  the  number  of 
bytes  to  be  transferred.  Then  the  transfer  time  of  a  neighboring  communication  can  be  expressed 
as  a  4-  S/3,  where  a  represents  a  fixed  startup  overhead  and  /?  is  the  incremental  transmission  time 
per  byte.  Assuming  4  bytes  are  used  for  each  real  number.  Step  3  and  Step  4  take  a  +  8j3  and  a +4/3 
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communication  respectively.  The  parallel  FDD  algorithm  needs  17^-4  parallel  computation  and 
2(a  +  6/?)  communication. 

2.3  Scalability  Analysis 

As  parallel  machines  have  been  built  with  more  and  more  processors,  the  performance  metric 
scalability  becomes  more  and  more  important.  Thus,  the  question  is  how  an  algorithm  will  perform 
when  the  problem  size  is  scaled  up  linearly  with  the  number  of  processors.  Let  T{p,W)  be  the 
execution  time  for  solving  a  system  with  W  work  (problem  size)  on  p  processors.  The  ideal  situation 
would  be  when  both  the  number  of  processors  and  the  amount  of  work  are  scaled  up  N  times,  the 
execution  time  remains  unchanged: 


T{N  xp,N  xW)  =  T{p,W) 


(20) 


How  one  should  define  problem  size,  in  general,  is  a  style  under  debate.  However,  it  is  commonly 
agreed  that  the  floating  point  (flop)  operation  count  is  a  good  estimate  of  problem  size  for  scientific 
computations.  To  eliminate  the  effect  of  numerical  inefficiencies  in  parallel  algorithms,  in  practice 
the  flop  count  is  based  upon  some  practical  optimal  sequential  algorithm.  In  our  case,  the  LU 
decomposition  has  chosen  as  the  sequential  algorithm.  It  takes  8n  -7  floating  point  operations, 
where  7  is  a  negligible  constant  number  when  n  is  large.  As  the  problem  size  W  increases  N  times 
to  W,  we  have 


=  iV  X  8n  =  8n' 
n'  =  N  -  n. 


(21) 


bet  Tcomp  represent  the  unit  of  a  computation  operation  normalized  to  the  communication  time. 
The  time  required  to  solve  (1)  by  the  FDD  algorithm  with  p  processors  is 


T{p,  W)  =  (17-  -  4)rcomp  +  2(a  +  6/?),  (22) 

P 

and 

TiNxp,NxW)  =  (17^  -  4)  Wp  +  2(q  +  6/?) 

=  (17^  -  4)rcomp  +  2(o  +  6;fl) 

=  (17f-4)rcomp  f  2(a  +  6/8) 

=  T{p,W). 

The  FDD  algorithm  has  the  ideal  scalability.  Similar  arguments  could  be  applied  to  periodic 
systems  (see  Section  3)  and  the  same  result  would  be  obtained. 

Using  the  isospeed  approach,  scalability  has  been  formally  defined  in  [20].  The  average  unit 
speed  is  defined  as  the  quotient  of  the  achieved  speed  of  the  given  computing  system  and  the  number 
of  processors.  Since  Eq.(20)  is  true  if  and  only  if  the  average  unit  speed  of  the  given  computing 
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system  is  a  constant,  the  scalability  is  defined  as  the  ability  to  maintain  the  average  unit  speed 
[20].  Let  W  be  the  amount  of  work  of  an  algorithm  when  p  processors  are  employed  in  a  machine, 
and  let  W  be  the  amount  of  work  of  the  algorithm  when  p'  =  N  ■  p  processors  are  employed  to 
maintain  the  average  speed,  then  the  scalability  from  system  size  p  to  system  size  N  •  p  of  the 
algorithm-machine  combination  is  defined  as 


ipip,  N  xp)  = 


N -p-W  N-W 
p-W'  ~  W  ‘ 


The  average  unit  speed  can  be  represented  as 


A^(p,W)  = 


p-T{p,wy 


where  W  is  the  problem  size,  p  is  the  number  of  processors,  and  T{p,W)  is  the  corresponding 
execution  time.  From  our  early  discussion,  for  the  FDD  algorithm,  when  W  —  N  ■  W,  we  have 
T{N  X  p,  W)  =  r(p,  W).  Therefore 


A^{N  X  p,lF'')  = 


W  _  W  N-W  W 

N  •  T(N  X  p,  W')  ~  ~N  •  T(p,  W)  ~  N  •  T(p,  W)  ~  T(p,wy 


That  isW'  =  N-  W  has  maintained  the  average  unit  speed,  and  the  scalability  is 


V»(p,iVxp)  =  -^=r 


N-W  N-W 
W'  ~  N-W 


It  is  the  ideal  scalability. 


3  Special  Applications 

In  this  section,  we  first  discuss  some  tridiagonal  systems  arising  in  CFD  applications,  the  symmetric 
and  antisymmetric  Toeplitz  tridiagonal  systems.  Then  two  variants  of  the  FDD  algorithm,  the 
reduced  FDD  algorithm  and  the  FDD  algorithm  for  periodic  systems,  will  be  presented. 

3.1  Toeplitz  Tridiagonai  Systems 

A  Toeplitz  tridiagonai  matrix  has  the  form 

be 
a  b  c 

A=  ...  =[a,b,c].  (28) 

.  .  c 

a  b 
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Symmetric  Toeplitz  tridiagonal  systems  are  often  arise  in  solving  partial  differential  equations  and 
in  other  scientific  applications.  Compact  finite  difference  scheme  [12]  is  a  relative  new  scheme  for 
solving  PDE’s.  Because  of  its  simplicity  and  high  accuracy,  it  has  been  widely  used  in  practice. 
Using  the  compact  scheme,  the  general  approximation  of  a  first  derivative  has  the  form; 


at'  I  ^  t'  I  t'  I  ^  t'  I  a  r'  _  «/*+3  ft— 3  ,  i_/i'+2  ft— 3  ,  _/i+l 

Pfi-2  +  °‘fi-l  +  ft  +  ®/i+l  +  Pfi+2  -  ^ ® - 1-  ^ 


Letting 

a  =  =  0,a  =  ^,6=  i,c  =  0,  (29) 

the  scheme  becomes  formally  sixth  order  accurate  and  the  resulting  system  is  [|,  1,  |],  a  symmetric 
Toeplitz  tridiagonal  system.  Similarly,  the  general  approximation  of  a  second  derivative  has  the 
form 


at"  .  t"  .  t"  .  t"  .  at"  ,/i+3  -  2/;  + /,_3  /i+2  “  2/,  +  /._2  ,  /.+!  “  2/,  +  /,_ i 

+  +  /i  +«/i+i  +P/i+2  =  c - rr;; - +  6 - rr; - +  a- 


9h^ 


4/i2 


For 

a  =  ^,/?  =  0,a=l^,6=^,c=0,  (30) 

a  sixth  order  difference  scheme  is  obtained,  and  the  tridiagonal  system  is  sj-mmetric  and  Toeplitz, 
[^,  1, ^].  Discretized  in  time,  the  one  dimensional  wave  equation  Ut  =  a-Ux  and  the  heat  equation 
Ut  —  a  •  Uxx  can  be  represented  as 


=  u"  +  At  •  a  •  (31) 

and 

=  «"  +  At .  a  •  (32) 

respectively. 

Using  the  compact  scheme,  and  defined  by  symmetric  Toeplitz  tridiagonal  systems. 

Therefore,  the  solutions  can  be  obtained  by  solving  a  sequence  of  symmetric  Toeplitz  tridiagonal 
systems.  Using  ADI  methods  [17],  parabolic  and  hyperbolic  systems  can  be  solved  by  solving  a 
sequence  of  symmetric  Toeplitz  tridiagonal  systems. 

Anti-symmetric  Toeplitz  tridiagonal  systems  also  arise  in  solving  PDEs  [17].  For  instance,  to 
solve  the  wave  equation  Ut  a  ■  Ux  =  /,  we  begin  with  the  formula 

+  +  (33) 
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for  «t  evaluated  aA  {t  +  \k,x).  We  also  use  the  relation 


ttxO  +  i/c,x)  =  +  0{]fc2) 

=  1  +  0{Jb2)  +  0(/i2). 

Using  these  approximations  for  Uj  +  a  •  Ui;  =  /  about  (t  +  \k,x),  we  obtain 

-  <t\ + »“« -  <-i  /r‘  +  fi 

4h  “2 


or,  equivalently, 


aX 

T 


„n+l 

‘^m+1 


+  V. 


,n+l 


n+1 
m  — 1 


aX 

- rV. 


m+l 


+  V” 


+  ^<-1  +  |(/r' + r. 


m/ 


The  left  side  is  an  antisymmetric  Toeplitz  tridiagonal  matrix,  A  =  1,  — 


(34) 


(35) 


(36) 


3.2  Periodic  Tridiagonal  Systems 

Many  PDE’s  arisen  in  real  applications  i  ave  periodic  boundary  conditions.  For  instance,  to  study 
a  physical  phenomenon  of  a  large  object,  we  often  simulate  only  a  small  portion  of  it  and  then 
^■pply  periodic  boundary  conditions  on  each  of  the  portions.  The  resulting  linear  systems  have  the 
form  of 


Aq  Co 

oi  bi  Cl 


Go 


1 


(37) 


On-2  bn-2  Cn-2 
Cn— 1  On— 1  ^n— 1 

and  are  called  periodic  tridiagonal  systems.  On  sequential  machine,  periodic  tridiagonal  systems 
are  solved  by  combining  the  solutions  of  two  different  right-sides  [7],  which  increases  the  operation 
count  from  8n  —  7  to  14n  —  16, 

The  PDD  algorithm  can  be  extended  to  periodic  tridiagonal  systems.  The  difference  is  that, 
after  dropping  and  matrix  Z  becomes  a  periodic  system  of  order  2p: 


Z  = 


1  0 
0  1 


,,(0) 


(38) 
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The  dimension  of  Z  is  slightly  higher  than  in  the  non-periodic  case,  which  simply  makes  the  load 
on  the  0th  and  (p-l)th  processor  identicol  to  load  on  all  of  the  other  processors.  The  parallel 
computation  time  remains  the  same.  For  periodic  systems,  the  communication  at  step  3  and  4 
chEunges  from  one  dimensional  array  communication  to  ring  communication.  The  communication 
time  is  also  unchanged.  Figure  1  shows  the  communication  pattern  of  the  FDD  algorithm  for 
periodic  systems. 


Figure  1.  Communication  Pattern  for  Solving  Periodic  Systems. 


3.3  The  Reduced  PDD  Algorithm 

In  the  last  step,  Step  5,  of  the  PDD  algorithm,  the  final  solution,  x,  is  computed  by  combining  the 
intermediate  results  concurrently  on  each  processor, 

a:(')  =  x(‘)  -  (39) 


which  requires  4(n  —  1)  operations  in  total  and  4m  parallel  operations,  if  p  =  n/m  processors  are 
used.  The  PDD  algorithm  drops  off  the  the  first  element  of  w,wq,  and  the  last  element  of  u,  Um-i, 
in  solving  Eq.  (12).  In  Section  4.1  -  4.2,  we  will  show  that,  for  symmetric  and  anti-symmetric 
Toeplitz  tridiagonal  systems,  the  wq  and  Vm-i  can  be  dropped  when  m  is  large  with  the  accuracy 
of  the  final  solution  unaffected.  Further  more,  we  have 


■,  m— 1  m~l 

Ha  +  bZTJo  f>^'y  to  to  ^ 


(40) 


So,  when  m  is  large  enough,  we  may  drop  off  Vj,  *  =  y ,  •  •  • ,  m  -  1,  and  Wi,  i  =  0, 1,  •  ■  • ,  y  -  1,  while 
maintaining  the  same  accuracy.  If  we  replace  v,  by  u,-,  where  u,  =  Uj  for  i  =  0, 1,  ■  •  y  -  1,  u,  =  0, 
for  t  =  y ,  •  •  • ,  m  -  1;  and  replace  w  by  w,  where  —  Wi  for  i  =  y,  ■  •  • ,  m  —  1,  and  th,  =  0,  for 
j  =  0, 1,  •  •  •,  y  -  1;  and  use  v,  w  in  step  5,  we  have 
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Step  5’ 

Ax<‘>  =  ^  j  (41) 

j(0  =  fC)  _  ^x(0.  (42) 

It  requires  2|  parallel  operation.  Replacing  Step  5  of  the  FDD  algorithm  by  Step  5’,  we  get  the 
reduced  FDD  algorithm  which  requires  15^  —  4  parallel  computations. 

4  Accuracy  Analysis 

The  FDD  algorithm  is  highly  efficient,  perfectly  scalable,  but  it  is  only  applicable  when  the  inter¬ 
mediate  results  <  i  <  p  —  1,  can  be  dropped  out.  However  this  dropping  may  lead  to 

inaccurate  or  even  wrong  solution.  Thus  an  accuracy  study  is  essential  for  applying  the  FDD  al¬ 
gorithm.  Some  study  have  been  done  for  the  accuracy  of  the  FDD  algorithm.  Sufficient  conditions 
have  been  given  [24,  2].  However,  the  study  is  for  the  general  case.  The  conditions  given  in  [24] 
are  difficult  to  verify  and  the  accuracy  bound  is  large.  In  this  section  we  focus  on  a  particular  class 
of  tridiagonal  systems,  symmetric  and  antisymmetric  Toeplitz  tridiagonal  systems.  Our  analysis  is 
four  fold.  First,  we  give  the  decay  rate  of  =  0,  •  •  -  jp  -  1.  They  are  the  entries  treated 

as  zeros  by  the  FDD  algorithm.  Second,  the  accuracy  of  the  FDD  algorithm  is  studied.  Then, 
we  analyze  the  accuracy  of  the  reduced  FDD  algorithm.  All  of  the  above  three  analysis  are  for 
symmetric  Toeplitz  tridiagonal  systems.  Finally,  we  extend  the  results  to  anti-symmetri'"  Toeplitz 
tridiagonal  systems. 

4.1  The  Decay  Rate  of  Vm-i  and  wq 

Symmetric  Toeplitz  tridiagonal  systems  have  the  form  A  =  [A,/?,  A]  =  A[l,c,  1],  where  c-0fX.  We 
assume  the  matrix  A  is  diagonal  dominant.  That  is  |c|  >  2.  To  study  the  accuracy  of  the  solution 
of  i4i  =  6,  we  first  study  the  matrix 


II 

^  a  1 

1  c  1 

1  . 

f  1 
b  1 
b 

\ 

^  a  1 

a 

\ 

< 

.  1 

1  c  J 

\ 

b  1  ) 

\ 

.  1 
a  / 

where  a  and  6  are  the  real  solutions  of 

b  +  a  =  c,  ba=  1.  (43) 

Since  a  b  =  1  and  jcj  >  2,  we  may  further  assume  that  |a|  >  1,  and  |6|  <  1. 
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The  LDL^  decomposition  of  B  is 


B  =  [6,l,0]x{0,a,0]x[0,l,6]. 


B-l  =  [0,l,61-‘  X  [0,a,0]-'  X  [6.1,0]-i 


1  -6  62  .  (-6)"-^ 

1  -6  .  (-6)"-2 

1  -6 
1 


Let  d  =  (1, 0,  ■  •  • ,  0)^,  then 


-6  1 
62  -6  1 


=  . E  i’7(-»r‘)’' 


«=o  t=l 


i=n-l 


/  6  0  . 

0  0  . 

AB  =  ... 


0  0  . 


(l,0,---,0)  = 


B  =  B  +  AB  =  [l,c,l] 

Then,  by  the  matrix  modification  formula  (7),  the  solution  of  By  =  d  is 

y  =  B~U  =  {B  +  VE^)-U 

=  B-U-B-^V(I  +  E'^B-^V)~^E'^B-U 


where 


(I  +  E^B-^V) 


«  +  ‘E"=o‘='’ 


n-X  n-1 


«=0  i=l 
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f  \ 


^E'^B 


ELo 


«+i>Er=o&^‘ 


V  i-b) 


n-l 


The  last  element  of  y  is 


=  hg'  bZ7^b^'  _^{-br-^ 

o  a  a  +  &E7=o^^'  ® 

_  iz^f 


1 


«  V«  +  6Erro‘6^‘, 

(note  a  b  =  1). 


Thus: 


The  first  element  of  y  is 


yo 


\  1  +  ErJo  V 

_  ErjJ  b-^'  ( _ 1  \_  6(1-62") 

a  \l  +  ^^ErTj62‘7  l-62("+i) 

6(1  -  62") 


M  = 


<|6(. 


1 1  _  52(n+l)  i 

For  the  original  system  Ax  =  d,  A  =  A[l,  c,  1],  the  first  element  of  x  is 


(47) 

(48) 

(49) 


(50) 


xo  = 


Vo 
A  ■ 


(51) 


The  last  element  of  x  is 


Xn-i 


Vn-l 

A 


(52) 


Since  for  ToepUtz  tridiagonal  systems,  each  submatrix  Ai,t  =  0,  •  •  •  ,p  -  1,  has  the  same  structure 
as  >1,  we  have  the  following  lemma: 


Lemma  1  If  where  m  =  n/p,  is  less  than  machine  accuracy,  then  u^Li,  *  =  0,  ■  •  ^p-  1,  can 
be  replaced  by  zero  without  affecting  the  accuracy  of  the  final  solution  of  Ax  -  d. 


With  similar  arguments,  we  can  prove  that  for  d  =  (0,  •  •  -  0, 1)^,  .42  =  d  has  solution 


,  _  Vn-(<+l) 

'  r~ 


(53) 
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In  particular 


Combining  with  Lemma  1,  we  have; 


Xn-l  -  X 
20  -  -X- 


Theorem  1  //  ^  j-* ,  m  =  n/p,  is  less  than  machine  accuracy,  then  the  PDD  algorithm  gives  an 
approximation  to  the  true  solution  within  machine  accuracy. 

4.2  Accuracy  of  the  PDD  Algorithm 

Theorem  1  says  that  if  Vm-\,Wo  are  less  than  machine  accuracy,  the  PDD  algorithm  gives  a  sat¬ 
isfactory  solution.  In  most  scientific  applications,  the  accuracy  requirement  is  much  weaker  than 
machine  accuracy.  We  now  study  how  the  decay  rate  of  influences  the  accuracy  of  the 

final  solution.  Our  study  starts  at  the  matrix  partition  formula  (7). 

Let 

y  =  (I+E'^A-^V)-'^E'^A-U.  (54) 

Substitute  y  into  equation  (7),  we  have 

X  =  A~^d  -  A~^Vy 

E'^x  =  E^A-U-  E^A-W  -y  (55) 

=  (/  -b  E'^A-^V)y  ~  E^A-^V  •  y  =  y. 

Let  y*  be  the  corresponding  solution  of  the  PDD  algorithm, 


y*  =  (7  -f  A-'V  -  D)-^E^  A'^d, 

where  D  is  the  2(p  —  1)  x  2(p  —  1)  matrix  which  contains  ail  the  elements.  Combined 

with  Eq.(54)  we  have 


(7  +  E'^A~^V)y  -  (7  +  E’^A-^V  -  D)y*  =  0, 


That  is 


(y*  -y)  =  {I  +  E^A-^V  -  D)-^D  •  y. 


Let  X*  be  the  corresponding  final  solution  of  the  PDD  algorithm.  Then 

I-  =  A-U-  A-^V  y- 

X  -  x’  =  A~^V{y^  -  y) 

=  A-^V(I  +'E'^A-^V  -  D)-^D  •  y 
=  A-^V{I  -b  E'^A-^V  -  D)-^D  •  E^x. 
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Thus, 


(56) 


^  +  E'^A-^v  -  ny^DE^i 


The  inequality  (56)  holds  for  general  tridiagonal  systems.  In  the  following  we  assume  the  special 
structure  of  symmetric  Toeplitz  tridiagonaJ  system  to  compute  the  norm  of  its  right  side.  We  use 
the  l-norm  in  our  study.  As  discussed  in  the  last  section, 


/ 


-1 


(I  +  E^A~^V  -  D)-'^  = 


\ 


(57) 


where  Z,  are  2x2  matrices: 


Z-(  ^ 

"  Ji)  1 


(58) 


For  symmetric  Toeplitz  tridiagonal  systems  =  a,  and  =  Wq^  -  =  b, 

for  i  =  0, •  •  ’jP  -  1.  So,  for  our  applications, 


^  „  ,  1  d 

Zi  -  Zt  -  I  ,  , 

a  1 


1 


1  -d 


(59) 


(60) 


D-E^  stretchs  D  from  a  2(p- 1)  x2(p-l)  matrix  to  a  2(p-l)  x  n  matrix.  Each  column  of  D  is 
either  a  zero  column  or  contains  only  one  possible  non-zero  element,  b.  (/  -f  E^A~^V  -  D)~^DE^ 
is  a  2(p  -  1)  X  n  matrix.  Each  of  its  column  either  is  a  zero  column  or  contains  only  two  possible 
non-zero  elements  ci,C2,  where 


and 


Cl 

C2 

C2 

Cl 


W, 


(0 


0  /  -a 

1  -  d2  I  1 


=  Z-i  ( 


=  Z; 


-1 


b 

0 


1-0* 


(61) 


(62) 


For  our  application  Aj  =  Ai,  and  aj,'^  =  =  A,  t  =  0,  •••,?-  1.  So,  =  v,  z=  w,i  = 

0,  •  •  •,p  -  1  (see  Eq.  (15)).  {A~W){I  -f-  E^ A~^V  -  D)~^D  •  is  an  n  x  t?  matrix,  with  each 


1.5 


column  is  either  zero  column  or  contains  only  ciw,C2V  or  C2W,civ  respectively.  Thus, 


||i-‘F(/  +  E^A-^V  -  D)--^DET\\  <  max{\\c2v\\  +  HciHI,  Ikiull  +  Hc^tuli} 
<  Icjllltull  4-  killjull  (note  l|ti;||  =  ||t;||,  £’9. (53)) 

=  (kzl  +  IciDlli^ll  =  +  |a|)l|ull  =  j^\\v\\. 


From  our  results  given  in  Section  4.1, 


kl  = 


6(1  -6^"») 


A(1  -  62(m+l)) 


<  kl  < 


and 


We  have 


lull  = 
< 


1-6* 

1 

Z^i=0 

(i+i&H 

A(a-6*”>+‘) 

1-6* 

hil- 

l^IP  1 

-I6I-) 

A(a-6*'"+>) 

1  (i-fc’)(i-i 

.S'  1 


1_62(">+») 


d-LM") 


/  1  I-IM"*  1 
^  •  !-#+>  •  iqsl 


{note  |aj  >  1,  |6j  <  1) 
Combining  the  inequalities  (56)  and  (63)  we  obtain  the  final  results 


^  wk=T) 


(63) 


(64) 


(65) 


I  -  X 


< 


r&i 


IMi  -  kl)!  X  (|o|  - 1) 


^  |A^(l4al)(lal-l) 


|A| 


(66) 


(67) 


Inequality  (66)  shows  how  the  values  of  Vm~i  and  tuo  influence  the  accuracy  of  the  final  results. 
Inequality  (67)  gives  an  error  bound  of  the  FDD  algorithm.  When  jj  <  1,  inequality  (67)  can  be 
simplified  to 

llx  —  1*11 

< 


i&r 


Ml  -  |A|(|Al-|6|)(|a!-l) 


4,3  The  Accuracy  of  the  Reduced  PDD  Algorithm 

For  the  sake  of  writing,  in  this  section  and  next  section  we  assume  m  =  n/p  is  an  even  integer.  Let 
V  be  the  matrix  corresponding  to  V  in  Eq.(9)  such  that  A~^V  results  the  vectors  v,w  (see  Section 
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3.3),  and  let  x'  be  the  solution  of  the  reduced  PDD  algorithm.  Then 

x'  =  A-^d  ~  A-^V)E'^ A-^ d.  (68) 

As  in  Section  4.2,  we  let  y  =  {I  +  E^A~^V)E^A~^d.  Notice  that  x"  is  the  solution  of  the  PDD 
algorithm  (see  Section  4.2).  By  Eq.  (7)  and  (55), 

x'  -  X*  =  {A-^V  -  A~^V)y  =  (A-'^V  -  A~^V}E^x, 


Therefore, 


< 

< 


\(A-^V  -  A-^V)\\ •  IIE^II  =  \\A-^V  -  =  j|u  - 

1-1  l(-6)'(l-6°<"— )) 


1 

Aa 

6f 


Aa 


1-b^ 

1 

lo+|fcr-^D(l»l^(i-IM^))l 

l_62(m+l) 

i-iM' 

Hi!)  1 

(Hi!) 

-  nfen 


Equation  (69)  gives  the  accuracy  of  the  reduced  PDD  algorithm. 

Ik  -  <  Ik  -  g*ll  Ik*  -  icil 

Ikll  "  Ikil  Ikll 

<  _ kr _ ,  k"l 

kl(kl-|rSiS3rl)(kl-i) 


(69) 

(70) 


4.4  Anti-Symmetric  Toeplitz  Tridiagonal  Systems 

The  accuracy  analysis  given  by  Sections  4.1  -  4.3  is  for  symmetric  Toeplitz  tridiagonaJ  systems.  In 
this  section  we  extend  the  results  to  anti-symmetric  Toeplitz  tridiagonal  systems.  We  assume  that 
m  =  n/p  is  an  even  number. 

An  anti- symmetric  Toeplitz  tridiagonal  matrix  A  has  the  form  A  =  [— A,^,A)  =  A  •  [-l,c,  1]. 
Let  B  =  [-l,c,  1].  Then,  for  the  corresponding  matrix  B  (see  Section  4.1) 


B  =  [6,1,0]  X  [0,0,1]  X  [0,1, -6], 


where  o,  b  are  the  solutions  of 

b  +  c^c,  b-a=-l.  (71) 

Comparing  with  symmetric  case,  the  only  difference  are  -b  in  matrix  [0, 1,  -6]  and  b  a  =  -1  in  Eq. 
(71).  Following  the  steps  given  in  the  study  of  symmetric  systems,  we  have  compiitod  the  vectors 
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of  V  and  w  in  Eq.  (15), 


*  ('  “  (SK‘ ( - 1  )■»"••  ES' ( - 1  -i)“- ‘ 


(72) 


w  ~ 


M«+4E[io 


We  can  see  for  anti-symmetric  Toeplitz  tridiagonal  systems  Vq^  =  and  = 

~Wq^  =  =  6,  for  t  =  0, •  •  -  .p  -  1.  Thus,  the  inequality  (63)  remains  true  for  anti-symmetric 

cases. 

By  Eq.  (72),  we  have 


x  =  „(0  (-i)"-'  i-br-(l  +  b-‘) 


and 


(») 

a  =  vy  = 


,1,  ,  |6"'(l  +  67)| 

"  - - iAi - ’ 

Mo  +  +  62(”‘+i))’ 


a  = 


-6»(l-6^"*) 


A(1  +  &2(-n-H)) 


l&l 


For  the  bound  of  the  norm  of  vector  u  (see  Eq.  (65)).  When  b'  a  —  -1, 


INI  -  |^C+6E::f(-i)-^0li 


0+ 


-  |Aa(l+S*('»+»))|  xqjl  S  |)k(|o|-l)r 


The  corresponding  relative  error 


in  terms  of  o  and  5;  and 


lk-a?1l  < 
INI!  “ 


fb\ 

llxll  -  !A(l-lal)(|al-l)l 


\br(l  +  b^)  1 

5r(i + &*) 

|A’(1  -  |o|)(|a|- 1)1  |A(|A|- 

6{l_62m) 

I(|ai-l)l 

l+42(m+l) 

(73) 


(74) 
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System 

Matrix 

Best 

sequential 

the  PDD 

Computation 

Communication 

m 

Non-periodic 

8n-7 

2a  -+■  12/1 

Periodic 

14n-16 

17^-4 

Multiple 

right-side 

Non-periodic 

(5ti  -  3)  ♦  nl 

(2q  +  8/3)  ♦  nl 

Periodic 

(7n  —  1)  +  nl 

Table  1.  Computation  and  Communication  Counts  of  the  PDD  Algorithm 


in  terms  of  a  and  b.  When  fi  <  1,  we  have 


|{x-z-||  |6r(l  +  6^) 

IWl  -  |A(lA|-|6|)(|a|-l)l 

For  the  reduced  PDD  algorithm,  when  the  system  is  anti-symmetric,  we  have 


and 


\v 


wll 


li  14-6^ 

|A  ■  l  +  62(m+l) 

lAl(lal-l)- 


Ha:- jII 
11*11  - 


lA(tAl- 


i6r(i-n>^) 


+ 


|6|m/2 

lAKlal-l)- 


5  Experimental  Results 

Table  1  gives  the  computation  and  communication  count  of  the  PDD  algorithm.  Since  the  tridi¬ 
agonal  systems  arising  in  both  ADI  and  in  the  compact  scheme  method  are  multiple  right-side 
systems,  the  computation  and  communication  count  of  solving  multiple  right-side  systems  is  also 
listed  in  Table  1,  where  the  factorization  of  matrix  A  is  not  considered  and  nl  is  the  number  of 
right-sides.  Note  for  multiple  right-side  systems,  the  communication  cost  increases  with  the  num¬ 
ber  of  right-sides.  Table  2  gives  the  computation  and  communication  counts  of  the  reduced  PDD 
algorithm.  As  the  PDD  algorithm,  it  has  the  same  parallel  computation  and  communication  counts 
for  periodic  and  non-periodic  systems, 

A  sample  matrix  is  chosen  to  illustrate  and  verify  the  algorithm  and  theoretical  results  given 
in  previous  sections.  The  sample  matrix  A  is  a  resulting  matrix  of  the  compact  scheme, 

=  (75) 
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System 

the  Reduced  FDD 

Computation 

Communication 

Single  system 

15^-4 

2q  -I-  12/? 

Multiple  right-side 

(7^-M)*nl 

(2a  -1-  8/?)  *  nl 

Table  2.  Computation  and  Communication  Counts  of  the  Reduced  FDD 


For  matrix  A, 


^  5^  =  5  •  [1*  3, 1]  =  i  •  ([6, 1, 0]  X  [0,  a,  0]  X  [0, 1, 6]  -  AB) , 


^3’  ’3^  3 

where  AB  is  given  by  Eq.(44),  and 


X  _  1  -  _  o  -  3+ V5  ,  3- 

3  ’  2  ’  2 


(76) 


Figure  2.  Measured  and  Predicted  Decay  Rate. 

The  FDD  algorithm  was  first  implemented  on  a  Xllr4  terminal  to  solve  the  corresponding 
periodic  system  of  Ax  =  d  for  accuracy  checking.  Then  the  algorithm  was  implemented  on  a 
32-node  Intel/860  to  measure  the  speedup  over  Thomas  algorithm  [7],  a  commonly  used  practical 
sequential  algorithm  for  periodic  tridiagonal  systems.  For  accuracy  checking,  all  the  measured  and 
predicted  data  have  been  converted  by  « logarithm  function  with  base  ten  to  make  the  difference 
visible.  Figure  2  depicts  the  decay  rate  of  Vm-\  of  matrix  A,  where  the  x-coordinate  is  the  order 
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of  the  sub-system  A,  and  the  y-coordinate  is  the  value  of  v^-i-  We  can  see  that  the  theoretical 
bound  given  in  Section  4.1  coincides  with  the  measured  value. 


Figure  3.  Measured  and  Predicted  Accuracy  of  the  PDD  Algorithm. 

Accuracy  comparisons  of  the  PDD  and  the  reduced  PDD  algorithms  are  given  in  Fig.  3  and  Fig. 
4  respectively.  For  the  accuracy  comparisons,  the  right-iide  vector,  d,  was  randomly  generated. 
The  x-coordinate  is  the  order  of  matrix  A,  and  the  y-coordinate  is  the  relative  error  in  the  1-norm. 
These  two  figures  show  that  our  accuracy  analysis  provides  a  very  good  bound. 

Figure  5  and  6  give  the  speedup  of  the  PDD  algorithm  over  Thomas  algorithm.  For  single 
system,  the  order  of  matrix  A  is  limited  by  the  machine  memory  for  n  =  6400.  For  multiple  right- 
sides,  the  system  is  limited  for  n  =  128  and  nl  =  4096.  From  Fig.  5  we  can  see  that  the  speedup 
of  solving  a  single  system  increases  linearly  with  the  number  of  processors.  Figure  6  shows  that 
the  linear  increasing  property  does  not  hold  for  multiple  right-side  systems.  The  lower  speedup  is 
due  to  the  increase  of  communication  cost.  Since  the  Intel/860  has  a  very  high  (communication 
speed)/(computation  speed)  ratio,  we  can  expected  a  better  speedup  on  an  Intel  Paragon  or  even 
on  an  Intel/iPSC2  [18]  multicomputer. 

6  Conclusion 

A  detailed  study  has  been  given  for  the  efficient  tridiagonal  solver,  the  Parallel  Diagonal  Dominant 
(PDD)  algorithm.  The  presented  PDD  algorithm  is  slightly  different  from  the  originally  proposed 
version  [21]  and  is  also  extended  to  periodic  systems.  A  variant,  the  reduced  PDD  algorithm,  was 
also  introduced.  Accuracy  analysis  is  provided  for  a  class  of  tridiagonal  systems,  the  symmetric 
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Order  of  Matrix 


Figure  4.  Measured  and  Predicted  Accuracy  of  the  Reduced  PDD 


Figure  5.  Measured  Speedup  Over  Thomas  Algorithm. 
Single  System  of  Order  $400 


Speedup 


Figure  6.  Measured  Speedup  Over  Thomas  Algorithm. 
4096  Systems  of  Order  128,  Factorization  Time  Not  Included 


and  anti-symmetric  Toeplitz  tridiagonal  systems.  Implementation  results  were  provided  for  both 
accuracy  analysis  and  for  the  proposed  algorithm.  They  showed  that  the  accuracy  analysis  provides 
a  very  good  theoretical  bound  and  that  the  algorithm  is  highly  efficient  for  both  single  and  multiple 
right-side  systems.  The  algorithm  is  a  good  candidate  for  large  scrde  computing,  where  the  number 
of  processors  and  the  problem  size  are  large.  It  is  a  good  choice  for  the  newly  emerged  massively 
parallel  machines,  such  as  Thinking  Machine  Corporations’s  CM-5  and  Intel’s  Paragon.  The  discus¬ 
sion  is  based  on  distributed-memory  machines.  The  residt  can  be  easily  applied  to  shared-memory 
machines  as  well. 

The  FDD  algorithm  and  the  reduced  FDD  algorithm  proposed  in  this  paper  can  be  extended 
to  band  systems  and  block  tridiagonal  systems.  The  accuracy  analysis,  which  gives  a  good,  simple 
relative  error  bound,  is  for  symmetric  and  anti-symmetric  Toeplitz  tridiagonal  systems  only.  It  is 
unlikely  that  the  analysis  can  be  extended  for  general  case  with  the  same  technique. 
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